In [1]:
%load_ext autoreload
%autoreload 2
import mpld3
mpld3.enable_notebook()

from package.cc import ChemicalChecker
import os

os.environ['CC_CONFIG'] = 'config.json'
cc_local = ChemicalChecker()

We will start by creating the space objects that will help us connect with the data to create the visualizations:

In [2]:
# Mechanism of Action (B1)
MoA = cc_local.get_signature('char4', 'full', 'B1.001')
# Therapeutic Areas (E1)
ATC = cc_local.get_signature('char4', 'full', 'E1.001')
# Side effects (E3)
side_effects = cc_local.get_signature('char4', 'full', 'E3.001')
2022-06-29 00:32:10.790690: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /.singularity.d/libs
2022-06-29 00:32:10.790722: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

This objects allow us to better explore the data. Following the naming use in packages such as sklearn, these objects have a 'fit' method to train the instance, generating all the data needed for the visualizations. Unfortunately, this process is data-intensive and computationally very expensive, as it involves performing a Fisher's exact test for each feature for each of the molecules. In a big space such as B4 this involves doing this computation 2 billion times (631027 molecules by 4635 features), not to mention the amounts of memory needed to store the results. For such reasons, the code is designed ad hoc to be ran in our HPC facilities. Nonetheless, it is possible to generate the visualizations using preprocessed data. To generate molecule visualizations, we just need to run the 'predict' method, giving a query molecule. This query can be input as an InChI key, SMILES or molecule name. Here we will analyse Atenolol, a beta-blocking agent used to treat high blood pressure and heart-associated chest pain.

In [3]:
%matplotlib inline

# Mechanism of action
_, df = MoA.predict('Atenolol')
df
RDKit ERROR: [00:32:12] SMILES Parse Error: syntax error while parsing: Atenolol
[00:32:12] SMILES Parse Error: syntax error while parsing: Atenolol
RDKit ERROR: [00:32:12] SMILES Parse Error: Failed parsing SMILES 'Atenolol' for input: 'Atenolol'
[00:32:12] SMILES Parse Error: Failed parsing SMILES 'Atenolol' for input: 'Atenolol'
[INFO    ] Using /tmp/tfhub_modules to cache modules.
2022-06-29 00:32:14.136370: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /.singularity.d/libs
2022-06-29 00:32:14.136405: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2022-06-29 00:32:14.136422: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (monzo): /proc/driver/nvidia/version does not exist
2022-06-29 00:32:14.136603: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-29 00:32:15.371292: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2022-06-29 00:32:15.391431: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 3092800000 Hz
100%|██████████████████████████████████████| 1001/1001 [00:01<00:00, 781.04it/s]
Out[3]:
Feature Description Score
0 P07550(1) Favors Beta-2 adrenergic receptor 1.00
1 P08588(-1) Against Beta-1 adrenergic receptor 1.00
2 Class:544(1) Favors Adrenergic receptor 1.00
3 P07550(-1) Against Beta-2 adrenergic receptor 1.00
4 Class:1266(1) Favors Monoamine receptor 1.00
5 Class:544(-1) Against Adrenergic receptor 0.90
6 Class:1088(1) Favors Small molecule receptor (family A GPCR) 0.88
7 Class:1020(1) Favors Family A G protein-coupled receptor 0.65
8 Class:11(1) Favors Membrane receptor 0.57
9 P13945(1) Favors Beta-3 adrenergic receptor 0.51
10 P08588(1) Favors Beta-1 adrenergic receptor 0.50
11 Class:1266(-1) Against Monoamine receptor 0.35
12 Class:1088(-1) Against Small molecule receptor (family A GPCR) 0.29
13 Class:0(1) Favors Protein class 0.23
14 P13945(-1) Against Beta-3 adrenergic receptor 0.23
15 Class:1020(-1) Against Family A G protein-coupled receptor 0.20
16 Class:11(-1) Against Membrane receptor 0.18

Calling this method generates a dataframe containing all the features enriched within the molecule's neighbourhood and a confidence score that are p-values transformed so they range from 0 to 1. It also produces an interactive figure showing the areas where these features are positively enriched, plus the position of the molecule and its neighbours. The legend is interactive allowing to choose which areas to display. You can also use the controls below the figure to pan, zoom in and out, and return to the original view. As shown, this molecule is contained by a region that is rich in beta-blocking agents, as itself. We can also take a look to the therapeutic areas and side effects that are enriched within its neighbourhood.

In [4]:
# Therapeutic areas
_, df = ATC.predict('Atenolol')
df
100%|███████████████████████████████████████| 856/856 [00:00<00:00, 2836.19it/s]
Out[4]:
Feature Description Score
0 C:C07A BETA BLOCKING AGENTS 1.00
1 B:C07 BETA BLOCKING AGENTS 1.00
2 A:C CARDIOVASCULAR SYSTEM 0.95
3 D:C07AB Beta blocking agents, selective 0.75
4 D:C07AA Beta blocking agents, non-selective 0.64
5 C:C01C CARDIAC STIMULANTS EXCL. CARDIAC GLYCOSIDES 0.37
6 C:C07B BETA BLOCKING AGENTS AND THIAZIDES 0.31